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Abstract. This paper shows, assuming standard heuristics regarding 
the number-field sieve, that a “batch NFS” circuit of area ^ 1181 -+°( 1 ) 
factors i / 0 - 5 +°( 1 ) separate B-bit RSA keys in time ^ 1 022 -+°( 1 ). Here L — 
exp((log 2 B ) 1 / 3 (log log 2 b ) 2/>3 ). The circuit’s area-time product (price- 
performance ratio) is just £ 1 - 704 ---+ O ( 1 ) p er key. For comparison, the best 
area-time product known for a single key is L 1 ' 976 '" + °^ 1 ). 

This paper also introduces new “early-abort” heuristics implying that 
“early-abort ECM” improves the performance of batch NFS by a super- 
polynomial factor, specifically exp((c + o(l))(log2 B ) 1,/6 (loglog2 B ) 5,/6 ) 
where c is a positive constant. 

Keywords: integer factorization, number-field sieve, price-performance 
ratio, batching, smooth numbers, elliptic curves, early aborts 



1 Introduction 

The cryptographic community reached consensus a decade ago that a 1024- bit 
RSA key can be broken in a year by an attack machine costing significantly less 
than 10 9 dollars. See [ 51 ], [ 38 ], [ 24 ], and [ 23 ]. The attack machine is an opti- 
mized version of the number- held sieve (NFS), a factorization algorithm that has 
been intensively studied for twenty years, starting in [ 36 ]. The run-time analy- 
sis of NFS relies on various heuristics, but these heuristics have been confirmed 
in a broad range of factorization experiments using several independent NFS 
software implementations: see, e.g., [ 29 ], [ 30 ], [ 31 ], and [ 4 ], 

Despite this threat, 1024-bit RSA remains the workhorse of the Internet’s 
“DNS Security Extensions” (DNSSEC). For example, at the time of this writing 
(November 2014), the IP address of the domain dnssec-deployment.org is 

• signed by that domain’s 1024-bit “zone-signing key”, which in turn is 

This work was supported by the National Science Foundation under grant 1018836 
and by the Netherlands Organisation for Scientific Research (NWO) under grant 
639.073.005. Permanent ID of this document: 4f 99blb911984e501c099f 514d8fd2ce. 
Date: 2014.11.09. 



2 



Daniel J. Bernstein and Tanja Lange 



• signed by that domain’s 2048-bit “key-signing key” , which in turn is 

• signed by . org’s 1024-bit zone-signing key, which in turn is 

• signed by . org’s 2048-bit key-signing key, which in turn is 

• signed by the DNS root’s 1024-bit zone-signing key, which in turn is 

• signed by the DNS root’s 2048-bit key-signing key. 

An attacker can forge this IP address by factoring any of the three 1024-bit RSA 
keys in this chain. 

A report [41] last year indicated that, out of the 112 top-level domains using 
DNSSEC, 106 used the same key sizes as .org. We performed our own survey 
of zone-signing keys in September 2014, after many new top-level domains were 
added. We found 286 domains using 1024- bit keys; 4 domains using 1152- bit 
keys; 192 domains using 1280-bit keys; and just 22 domains using larger keys. 
Almost all of the 1280-bit keys are for obscure domains such as .boutique and 
.rocks; high-volume domains practically always use 1024-bit keys. 

Evidently DNSSEC users find the attacks against 1024-bit RSA less worrisome 
than the obvious costs of moving to larger keys. There are, according to our 
informal surveys of these users, three widespread beliefs supporting the use of 
1024-bit RSA: 

• A typical RSA key is believed to be worth less than the cost of the attack 
machine. 

• Building the attack machine means building a huge farm of application- 
specific integrated circuits (ASICs). Standard computer clusters costing the 
same amount of money are believed to take much longer to perform the same 
calculations. 

• It is believed that switching RSA signature keys after (e.g.) a month will 
render the attack machine useless, since the attack machine requires a full 
year to run. 

Consider, for example, the following quote from the latest “DNSSEC operational 
practices” recommendations [32, Section 3.4.2], published December 2012: 

DNSSEC signing keys should be large enough to avoid all known crypto- 
graphic attacks during the effectivity period of the key. To date, despite 
huge efforts, no one has broken a regular 1024-bit key; in fact, the best 
completed attack is estimated to be the equivalent of a 700-bit key. An 
attacker breaking a 1024-bit signing key would need to expend phenom- 
enal amounts of networked computing power in a way that would not be 
detected in order to break a single key. Because of this, it is estimated 
that most zones can safely use 1024-bit keys for at least the next ten 
years. 

This quote illustrates the first and third beliefs reported above: the attack cost 
would be “phenomenal” and would break only “a single key” ; furthermore, the 
attack would have to be completed “during the effectivity period of the key” . A 
typical DNSSEC key is valid for just one month and is then replaced by a new 
key. 
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1.1. Contents of this paper. This paper analyzes the asymptotic cost, specif- 
ically the price-performance ratio, of breaking many RSA keys. We emphasize 
several words here: 

• “Many”: The attacker is faced not with a single target, but with many tar- 
gets. The algorithmic task here is not merely to break, e.g., a single 1024- 
bit RSA key; it is to break more than two hundred 1024-bit RSA keys for 
DNSSEC top-level domains, many more 1024-bit RSA keys at lower levels 
of DNSSEC, millions of 1024-bit RSA keys in SSL (as in [25] and [35]; note 
that upgrading SSL to 2048-bit RSA does nothing to protect the confiden- 
tiality of previously recorded SSL traffic), etc. This is important if there are 
ways to share attack work across the keys. 

• “Price-performance ratio”: As in [53], [18], [50], [15], [54], [7], [51], [56], 
[23], [24], etc., our main interest is not in the number of “operations” carried 
out by an algorithm, but in the actual price and performance of a machine 
carrying out those operations. Parallelism increases price but often improves 
performance; large storage arrays are a problem for both price and perfor- 
mance. We use price-performance ratio as our primary cost metric, but we 
also report time separately since signature-key rotation puts a limit upon 
time. 

• “Asymptotic” : The cost improvements that we present are superpolynomial 
in the size of the numbers being factored. We thus systematically suppress 
all polynomial factors in our cost analyses, simplifying the analyses. 

This paper presents a new “batch NFS” circuit of area ^ 1 - 181 -+°( 1 ) that, 
assuming standard NFS heuristics, factors L°- 5 +°( 1 ) separate R-bit RSA keys 
in total time just L 1022 The area-time product is L 1704 ••■+°( 1 ) for each 
key; i.e., the price-performance ratio is t 1 - 704 -"+°( 1 ) . Here (as usual for NFS) L 
means exp((log A 7 ) 1 / 3 (fog fog N) 2 / 3 ) where N = 2 s . 

For comparison (see Table 1.4), the best area-time product known for factoring 
a single key (without quantum computers) is T 1 - 976 -" 1 " 0 ^), even if non-uniform 
precomputations such as Coppersmith’s “factorization factory” are allowed. The 
literature is reviewed below. 

This paper also looks more closely at the L °( 1 \ The main bottleneck in batch 
NFS is not traditional sieving, but rather low-memory smoothness detection, mo- 
tivating new attention to the complexity of low-memory smoothness detection. 
Traditional ECM, the elliptic-curve method of recognizing y-srriooth integers, 
works in low memory and takes time exp ( y(2 + o( 1 ) ) log y log log y) . One can 
reasonably guess that, compared to traditional ECM, “early-abort ECM” saves 
a subexponential factor here, but the complexity of early-abort ECM has never 
been analyzed. Section 3 of this paper introduces new early-abort heuristics 
implying that the cost of early-abort ECM is exp (§ + °(1)) log y log log y'j . 
Using early aborts increases somewhat the number of auxiliary integers that need 
to be factored, producing a further increase in cost, but the cost is outweighed 
by the faster factorization. 

The ECM cost is obviously bounded by L o( - 1 ' > : more precisely, the cost is 
exp(0((log A 7 ) 1 / 6 (fog log A 7 ) 5 / 6 )) in the context of batch NFS, since y G L©d). 
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This cost is invisible at the level of detail of l 1 - 704 - +o 1 1 ). The speedup from ECM 
to early-abort ECM is nevertheless superpolynomial and directly translates into 
the same speedup in batch NFS. 

1.2. Security consequences. We again emphasize that our results are asymp- 
totic. This prevents us from directly drawing any conclusions about 1024-bit 
RSA, or 2048-bit RSA, or any other specific RSA key size. Our results are nev- 
ertheless sufficient to undermine all three of the beliefs described above: 

• Users comparing the value of an RSA key to the cost of an attack machine 
need to know the per-key cost of batch NFS. This has not been seriously 
studied. What the literature has actually studied in detail is the cost of NFS 
attacking one key at a time-, this is not the same question. Our asymptotic 
results do not rule out the possibility that these costs are the same for 
1024-bit RSA, but there is also no reason to be confident about any such 
possibility. 

• Most of the literature on single-key NFS relies heavily on operations that — 
for large key sizes — are not handled efficiently by current CPUs and that 
become much more efficient on ASICs: consider, for example, the routing 
circuit in [51]. Batch NFS relies much more heavily on massively parallel 
elliptic-curve scalar multiplication, exactly the operation that is shown in 
[12], [11], and [17] to fit very well into off-the-shelf graphics cards. The 
literature supports the view that off-the-shelf hardware is much less cost- 
effective than ASICs for single- key NFS, but there is no reason to think that 
the same is true for batch NFS. 

• The natural machine size for batch NFS (i.e., the circuit area if price- 
performance ratio is optimized) is larger than the natural machine size for 
single-key NFS, but the natural time is considerably smaller. As above, these 
asymptotic results undermine any confidence that one can obtain from com- 
paring the natural time for single-key NFS to the rotation interval for sig- 
nature keys: there is no reason to think that the latency of batch NFS will 
be as large as the latency of single-key NFS. Note that, even though this pa- 
per emphasizes optimal price-performance ratio for simplicity, there are also 
techniques to further reduce the time below the natural time, hitting much 
lower latency targets without severely compromising price-performance ra- 
tio: in particular, for the core sorting subroutines inside linear algebra, one 
can replace time T with T/ f at the expense of replacing area A with A/ 2 . 

The standard measure of security is the total cost of attacking one key. For 
example, this is what NIST is measuring in [6] when it reports “80-bit security” 
for 1024-bit RSA, “112- bit security” for 2048-bit RSA, “128-bit security” for 
3072-bit RSA, etc. What batch NFS illustrates is that, when there are many 
user keys, the attacker’s cost per key can be smaller than the attacker’s total 
cost for one key. It is much more informative to measure the attacker’s total 
cost of attacking U user keys, as a function of U. It is even more informative 
to measure the attacker’s chance of breaking exactly K out of U simultaneously 
attacked keys in time T using a machine of cost A, as a function of (K, U, T , A). 
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There are many other examples of cryptosystems where the attack cost does 
not grow linearly with the number of targets. For example, it is well known 
that exhaustive search finds preimages for U hash outputs in about the same 
time as a preimage for a single hash output; furthermore, the first preimage that 
it finds appears after only 1/U of the total time, reducing actual security by 
lg U bits. However, most cryptosystems have moved up to at least a “128-bit” 
security level, giving them a buffer against losing some bits of security. RSA 
is an exception: its poor performance at high security levels has kept it at a 
bleeding-edge “80-bit security” level. Even when users can be convinced to move 
away from 1024-bit keys, they normally move to <2048-bit keys. We question 
whether it is appropriate to view 1024-bit keys as “80-bit” security and 2048- bit 
keys as “112-bit” security if the attacker’s costs per key are not so high. 

1.3. Previous work. In the NFS literature, as in the algorithm literature in 
general, there is a split between traditional analyses of “operations” (adding two 
64- bit integers is one “operation” ; looking up an element of a 2 64 -byte array is one 
“operation”) and modern analyses of more realistic models of computation. We 
follow the terminology of our paper [14]: the “RAM metric” counts traditional 
operations, while the “AT metric” multiplies the area of a circuit by the time 
taken by the same circuit. 

Buhler, H. Lenstra, and Pomerance showed in [19] (assuming standard NFS 
heuristics, which we now stop mentioning) that NFS factors a single key N with 
RAM cost £i- 922 -+o(i). As above, L means exp((log2 B ) 1 / 3 (loglog2 B ) 2 / 3 ) if N 
has B bits. This exponent 1.922 ... is the most frequently quoted cost exponent 
for NFS. 

Coppersmith in [20] introduced two improvements to NFS. The first, “mul- 
tiple number fields”, reduces the exponent 1.922 . . . + o(l) to 1.901 . . . + o(l). 
The second, the “factorization factory” , is a non-uniform algorithm that reduces 
1.901 . . . + o(l) to just 1.638 . . . + o(l). Recall that (size-)non-uniform algorithms 
are free to perform arbitrary amounts of precomputation as functions of the size 
of the input, i.e., the number of bits of N. A closer look shows that Coppersmith’s 
precomputation costs £ 2 - 006 ---+ o (t) ? so if it is applied to more than T 0 ‘ 368 " +o ( 1 ) 
inputs then the precomputation cost can quite reasonably be ignored. 

Essentially all of the subsequent NFS literature has consisted of analysis and 
optimization of algorithms that cost £ 1 ' 922 '” + °W, such as the algorithm of [19]. 
The ideas of [20] have been dismissed for three important reasons: 

• The bottleneck in [19] is sieving, while the bottleneck in [20] is ECM. Both 
of these algorithms use L°A) operations in the RAM metric, but the o(l) is 
considerably smaller for sieving than for ECM. 

• Even if the o(l) in [20] were as small as the o(l) in [19], there would not be 
much benefit in 1.901 . . . + o(l) compared to 1.922 . . . + o(l). For example, 
(2 50 ) 1 - 922 ~ 2 96 while (2 50 ) 1 - 901 « 2 95 . 

• The change from 1.901 . . . + o(l) to 1.638 . . . + o(l) is much larger, but it 
comes at the cost of massive memory consumption. Specifically, [20] requires 
space A 1 - 638 — FA 1 ), while [19] uses space just A 0 - 961 ■••+°( 1 ). This is not visible 
in the RAM metric but is obviously a huge problem in reality, and it becomes 
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metric 


exponent 


precomp 


batch 


source 


AT 


1.976... 


0 


0 


2001 Bernstein [7] 


RAM (unrealistic) 


1.922... 


0 


0 


1993 Buhler-H. Lenstra— Pomerance [19] 


RAM (unrealistic) 


1.901... 


0 


0 


1993 Coppersmith [20] 


AT 


1.900... 


0 


0.1 


batch NFS; this paper 


AT 


1.829... 


0 


0.2 


batch NFS; this paper 


AT 


1.763... 


0 


0.3 


batch NFS; this paper 


AT 


1.710... 


0 


0.4 


batch NFS; this paper 


AT 


1.704... 


0 


0.5 


batch NFS; this paper 


RAM (unrealistic) 


1.638... 


2.006... 


0 


1993 Coppersmith [20] 



Table 1.4. Asymptotic exponents for several variants of NFS, assuming standard 
heuristics. “Exponent” e means asymptotic cost L e+ °^ per key factored. “Precomp” 
29 means that there is a precomputation involving integer pairs (a, b) up to L e+o( ' 1 \ for 
total precomputation cost £, 2e +°W. algorithms without precomputation have 29 = 0. 
“Batch” /3 means batch size L ® +o(x ' 1 ; algorithms handling each key separately have 
/3 = 0. See Section 2 for further details. 



increasingly severe as computations grow larger. As a concrete illustration 
of the real-world costs of storage and computation, paying for 2 70 bytes of 
slow storage (about 30 • 10 9 USD in hard drives) is much more troublesome 
than paying for 2 80 floating-point multiplications (about 0.02 • 10 9 USD in 
GPUs plus 0.005 ■ 10 9 USD for a year of electricity). 

We quote A. Lenstra, H. Lenstra, Manasse, and Pollard [37]: “There is no indi- 
cation that the modification proposed by Coppersmith has any practical value.” 
At the time there was already more than a decade of literature showing how to 
analyze algorithm asymptotics in more realistic models of computation that ac- 
count for memory consumption, communication, etc.; see, e.g., [18]. Bernstein in 
[7] analyzed the circuit performance of NFS, concluding that an optimized circuit 
of area would factor N in time £ 1 - 18 ---+o(i)^ f or price-performance 

ratio l 1 - 976 -+°( 1 ). [7] did not analyze the factorization factory but did analyze 
multiple number fields, concluding that they did not reduce AT cost. The gap 
between the RAM exponent 1.901 . . . + o(l) from [20] and the AT exponent 
1.976 . . . + o(l) from [7] is explained primarily by communication overhead in- 
side linear algebra, somewhat moderated by parameter choices that reduce the 
cost of linear algebra at the expense of relation collection. 

We pointed out in [14] that the factorization factory does not reduce AT 
cost. In Section 2 we review the reason for this and explain how batch NFS 
works around it. We also presented in [14] a superpolynomial improvement to 
the factorization factory in the RAM metric, by eliminating ECM in favor of 
batch trial division, but this is not useful in the AT metric. 
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2 Exponents 

This section reviews NFS and then explains how to drastically reduce the AT 
cost of NFS through batching. The resulting cost exponent, 1.704 ... in Table 1.4, 
is new. All costs in this section are expressed as L e+ °( 1 ) for various exponents e. 
Section 3 looks more closely at the L ° W factor. 

2.1. QS: the quadratic sieve (1982). As a warmup for NFS we briefly review 
the general idea of combining congruences, using QS as an example. 

QS writes down a large collection of congruences modulo the target integer 
N and tries to find a nontrivial subcollection whose product is a congruence of 
squares. One can then reasonably hope that the difference of square roots has a 
nontrivial factor in common with N. 

Specifically, QS computes s ~ y/N and writes down the congruences s 2 = 
s 2 — N. (s + 1) 2 = (s + 1) 2 — AT, etc. The left side of each congruence is already a 
square. The main problem is to find a nontrivial set of integers a such that the 
product of (s + a) 2 — N is a square. 

If (s + a) 2 — N is divisible by a very large prime then it is highly unlikely 
to participate in a square: the prime would have to appear a second time. QS 
therefore focuses on smooth congruences: congruences where (s+a) 2 — N factors 
completely into small primes. Applying linear algebra modulo 2 to the matrix 
of exponents in these factorizations is guaranteed to find nonempty subsets of 
the congruences with square product once the number of smooth congruences 
exceeds the number of small primes. 

The integers a such that (s + a) 2 — N is divisible by a prime p form a small 
number of arithmetic progressions modulo p. “Sieving” means jumping through 
these arithmetic progressions to mark divisibility, the same way that the sieve 
of Eratosthenes jumps through arithmetic progressions to mark non-primality. 

2.2. NFS: the number-field sieve (1993). NFS applies the same idea, but 
instead of congruences modulo N it uses congruences modulo a related algebraic 
number m — a. This algebraic number is chosen to have norm N (divided by 
a certain denominator shown below), and one can reasonably hope to obtain a 
factorization of N by obtaining a random factorization of this algebraic number. 

Specifically, NFS chooses a positive integer m, and writes N as a polynomial 
in radix m, namely N = f(m) where / is a degree-d polynomial with coefficients 
fdi fd—ii ■ ■ ■ , /o £ {0, 1, . . . , m — 1}. It is not difficult to see that optimizing NFS 
requires d to grow slowly with N, so m is asymptotically on a much smaller scale 
than N, although not as small as L. More precisely, NFS takes 

m e exp((/r + o(l))(log N) 2 / 3 (log log AQ 1 / 3 ) 

where /r is a positive real constant, optimized below. Note that the inequalities 
m d < N < m d+1 imply 

d € {l/p + o(l))(logN) 1/3 (loglogN) -1 / 3 . 

If / is reducible then its factorization is easy to compute and (for N reasonably 
large compared to d) reveals a nontrivial factorization of N (see [36]), so assume 



Daniel J. Bernstein and Tanja Lange 



from now on that / is irreducible. Define ck as a root of /. The norm of a — ba is 
then f d a d + fd- 1 a d ~ 1 b + ■ ■ ■ + f 0 b d (divided by fd), and in particular the norm 
of m — a is N (again divided by fd)- 

NFS uses the congruences a — bm = a — ba modulo m — a. There are now 
two numbers, a — bm and a — ba, that both need to be smooth. Smoothness 
of the algebraic number a — ba is defined as smoothness of the (scaled) norm 

fdd d + fd-ia d ~ 1 b-\ b fob' 1 - and smoothness of an integer is defined as having 

no prime divisors larger than y. Here y e L' /+o( 1 ) is another parameter chosen 
by NFS; 7 > 1/(6 y) is another real constant, optimized below. 

The range of pairs (a, b) searched for smooth congruences is the set of coprime 
integer pairs in the rectangle [— H, H] x [1, H] . Here H is chosen so that there will 
be enough smooth congruences to produce squares at the end of the algorithm. 
Standard heuristics state that a — bm has smoothness probability L _ / i /( 3 'T')+°( 1 ) if 
a and b are on much smaller scales than m; in particular, if H £ L d+ °^ for some 
positive real number 9 then the number of congruences with a — bm smooth is 
L't’+oi 1 ) w ith cf) = 29 — y/ (37). Standard heuristics also provide the simultaneous 
smoothness probability of a — bm and a — ba, implying that to obtain enough 
smooth congruences one can take H e L e+ °t l '> with 9 — ( 3 p 7 2 + 2 /x 2 )/( 6 p 7 — 1) 
and <j) = (18/r7 3 + 6 /r 2 7 + /r)/(18/r7 2 — 37 ). See, e.g., [19]. We henceforth assume 
these formulas for 9 and </> in terms of y and 7 . 

2.3. RAM cost analysis (1993). Sieving for y-smoothness of H 2+o(1 ' > poly- 
nomial values uses H 2 +°( 1 ) operations, provided that y is bounded by R 2 +°( 1 ). 
The point here is that the pairs (a, b) with congruences divisible by p form a 
small number of shifted lattices of determinant p, usually with basis vectors of 
length 0(^/p), making it easy to find all the lattice points inside the rectangle 
[— H, H] x [1 ,H]. The number of operations is thus essentially the number of 
points marked, and each point is marked just Yh P < y 1/p ~ log logy times. 

Sparse techniques for linear algebra involve y 1 +°( 1 ) matrix-vector multiplica- 
tions, each involving y 1+o( - 1 ') operations, for a total of y 2 +°( 1 ) operations. Other 
subroutines in NFS take negligible time, so the overall RAM cost of NFS is 

L max{20,2 7 }+ O (l)_ 

It is not difficult to see that the exponent max{20, 27} achieves its mini- 
mum value (64/9) 1 / 3 = 1.922... with y = (1/3) 1 / 3 = 0.693... and 9 = 7 = 

(8/9) 1//3 = 0.961 This exponent 1.922 ... is the NFS exponent from [19], and 

as mentioned earlier is the most frequently quoted NFS exponent. We do not 
review the multiple- number-fields improvement to 1.901 .. . from [20]; as far as 
we know, multiple number fields do not improve any of the exponents analyzed 
below. 

2.4. AT cost analysis (2001). In the AT metric there is an important ob- 
stacle to cost H 2 +°( 1 ) for sieving: namely, communicating across area 

takes time at least H l+ °^ . One can efficiently split the sieving problem into 
H 2+°(i) / y i+°(i) taskg; running 

one task after another on a smaller array of size 
y 1+ °( 1 > . but communicating across this array still takes time at least y°- 5 +°( 1 ) } 
so AT is at least H 2+o ^y 0 - 5+o ( 1 \ 
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Fortunately, there is a much more efficient alternative to sieving: ECM, ex- 
plained in Appendix A. What matters in this section is that ECM tests y- 
smoothness in time y°^ on a circuit of area y°A) . A parallel array of ECM 
units, each handling a separate number, tests y-smoothness of H 2+ °^ poly- 
nomial values in time iy 2 +°( 1 )/y 1 +°( 1 ) on a circuit of area y 1 +°( 1 ). achieving 
AT = iy 2+o ( 1 ). 

Unfortunately, the same obstacle shows up again for linear algebra, and this 
time there is no efficient alternative. Multiplying a sparse matrix by a vector 
requires time y°- 5 +°( 1 ) on a circuit of area y 1 +°( 1 ), and must be repeated y 1 +°( 1 ) 
times. The overall AT cost of NFS is 2 J max f 26,;2 - 57 }+°( 1 )_ 

The exponent max{20, 2.5y} achieves its minim u m value 1.976. . . with // = 

0. 702 . . ., 7 = 0.790 . . ., and 9 — 0.988 — This exponent 1.976 ... is the NFS 
exponent from [7]. Notice that 7 is much smaller here than it was in the RAM 
optimization: y has been reduced to keep the cost of linear algebra under control, 
but this also forced 6 to increase. 

2.5. The factorization factory (1993). Coppersmith in [20] precomputes 
“tables which will be useful for factoring any integers in a large range . . . after 
the precomputation, an individual integer can be factored in time L[ 1/3, 1.639]”, 

1. e. I- 1 - 63 ^ 1 ). 

Coppersmith’s table is simply the set of (a, b ) such that a — bm is smooth. One 
reuses m, and thus this table, for any integer N between (e.g.) m d and m d+1 . 

Coppersmith’s method to factor “an individual integer” is to test smoothness 
of a — ba for each (a, b) in the table. At this point Coppersmith has found the 
same smooth congruences as conventional NFS, and continues with linear algebra 
in the usual way. 

Coppersmith uses ECM to test smoothness. The problem with sieving here 
is not efficiency, as in the (subsequent) paper [7], but functionality: sieving can 
handle polynomial values only at regularly spaced inputs, and the pairs (a, b) in 
this table are not regularly spaced. 

Recall that the size of this table is with </> = 29 — y/( 3y). ECM uses 

L ° W operations per number, for a total smoothness cost of L^+°d) ; asymptoti- 
cally a clear improvement over the L 2<9+ °( 1 ) for conventional NFS. 

The overall RAM cost of the factorization factory is L max f^> 2 'T'}+°( 1 ). The 
exponent achieves its minimum value 1.638 . . . with fi = 0.905 . . ., 7 = 0.819 . . ., 
9 = 1.003 . . ., and </> = 1.638 This is the exponent from [20]. 

The AT metric tells a completely different story, as we pointed out in [14], 
The area required for the table is L^+U 1 ). This area is easy to reuse for very 
fast parallel smoothness detection, fini s hing in time L°A) . Unfortunately, col- 
lecting the smooth results then takes time £ / 0 - 5 <P+ o ( 1 ) ^ for an AT cost of at least 
£max{i.5<£,2.57}+o(i)^ never m i nc i the problem of matching the table area with the 
linear-algebra area. The minimum exponent here is above 2.4. 

2.6. Batch NFS (new). We drastically reduce AT cost by sharing work across 
many AT’s in a different way: we process a batch of N's in parallel, rather than 
performing precomputation to be used for one N at a time. We dynamically 
enumerate the pairs (a, b) with a — bm smooth, distribute each pair across all 
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Fig. 2.7. Relation-search mesh finding pairs (a, b) where a — bm is smooth. The 
following exponents are optimized for factoring a batch of ^ / °- 5 +°( 1 ) B-bit inte- 
gers: The mesh has height width L 0 ' 26 " 1 " 0 ! 1 ), and area i / °- 5 +°( 1 ). The 

mesh consists of L°' 5+o( ' 1 ^ small parallel processors (illustration contains 16). Each 
processor has area L o(1 K Each processor knows the same m 6 exp((0. 92115 + 
o(l))(log2 B ) 2 / 3 (loglog2 B ) 1 ^ 3 ). Each processor generates its own t 0 - 20048 ^^ 1 ) pa i rs 
(a, b), where a and b are bounded by l 1A) ' 772 ' 12 + 0< - 1 ') . Each processor tests each of its 
own a — bm for smoothness using ECM, using smoothness bound L°' 681600+o(1 L To- 
gether the processors generate t°' 700484+c <L separate pairs (a, b), of which t°' 25+o(1) 
have a — bm smooth. 



the iV’s in the batch, and remove each pair as soon as possible, rather than 
storing a complete table of the pairs. To avoid excessive communication costs we 
completely reorganize data in the middle of the computation: at the beginning 
each N is repeated many times to bring N close to the pairs (a, b ), while at the 
end the pairs (a, b ) relevant to each N are moved much closer together. The rest 
of this subsection presents the details of the algorithm. 

Consider as input a batch of f 1 ) simultaneous targets N within the large 
range described above. We require /3 < min{20 — 2y, 4 9 — 2 (j)}; if there are more 
targets available at once then we actually process those targets in batches of size 
£mm{24>— 27,40— 2</>}+o(i)^ s t or ing no data between runs. 

Consider a square mesh of L^ + °^ small parallel processors. This mesh is 
large enough to store all of the targets N. Use each processor in parallel to test 
smoothness of a — bm for £ 2 0-<£-O-50+o(i) p a i rs ( 0) b) using ECM; by hypothesis 
2 9 — (j) — 0.5/3 > 0. The total number of pairs here is £ 26, ~ < H 0 - 5/3 +o W. Each 
smoothness test takes time L o( ' 1 '> . Overall the mesh takes time l 26 '- < / > -°- 5 /3+ c, ( 1 ) 
and produces a total of L 0 - 5 ^ +o ^ pairs (a, b) with a — bm smooth, i.e., only L°C) 
pairs for each column of the mesh. See Figure 2.7. 

Move these pairs to the top row of the mesh (spreading them evenly across 
that row) by a standard sorting algorithm, say the Schnorr-Shamir algorithm 
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Fig. 2.8. Relation-search mesh from Figure 2.7, now finding pairs (a, b) where both 
a — bm and a — bct l are smooth. For a batch of / / °- 5 +°( 1 ) B-bit integers: The mesh 
knows £°' 25 + o(1 ) pairs (a, b ) with a — bm smooth from Figure 2.7. Each (a, b) is copied 
£ 0 . 25 + 0 ( 1 ) times (2 times in the illustration) so that it appears in the first two rows, the 
next two rows, etc. Each (a, b ) visits each mesh position within l°- 25 +°W steps (8 steps 
in the illustration). Each processor knows its own target Ni and the corresponding a t . 
and in each step tests each a — bcti for smoothness using ECM. Together Figure 2.7 
and Figure 2.8 take time L 0 ' 25+o(1) to search l°- 700484+o « pa i rs ( a ,b). 



from [50], taking time £, 0 - 5 l 3 + o ( 1 ) _ Then broadcast each pair to its entire column, 
taking time L a5/9+ ° n ; . Actually, it will suffice for each pair to appear once 
somewhere in the first two rows, once somewhere in the next two rows, etc. 

Now consider a pair at the top-left corner. Send this pair to its right until it 
reaches the rightmost column, then down one row, then repeatedly to its left, 
then back up. In parallel move all the other elements in the first two rows on 
the same path. In parallel do the same for the third and fourth rows, the fifth 
and sixth rows, etc. Overall this takes time L 0 ' 5 ® +o - 1 ) . 

Observe that each pair has now visited each position in the mesh. When a 
pair (a, b ) visits a mesh position holding a target N. use ECM to check whether 
a — ba is smooth, taking time L o(l \ The total time to check all L°- 5fS+o(1 ^ pairs 
against all T^+oC 1 ) targets is just T 0 - 5 ^ 0 ! 1 ), plus the time L, 29 -<l>- 0 - 5 P +°( l ) to 
generate the pairs in the first place. See Figure 2.8. 

Repeat this entire procedure times; by hypothesis — 7 — 

0.5/3 > 0. This covers a total of t 20 - 7 +o(i) pairs (o, b), of which have 

a — bm smooth, so for each N there are L°^ pairs (a, b) for which a — bm and 
a — ba are both smooth. The total number of smooth congruences found this way 
across all N is l/T 0 ! 1 ). Store each smooth congruence as (N. a. b): all of these 
together fit into a mesh of area L /3+ °^ 1 ). The time spent is T max f>~ 7 ’ 26 , ~ 7 -/ 3 1 +°W . 

Build L 7 +°( 1 ) copies of the same mesh, all operating in parallel, for a total 
circuit area of L /3+7+ °( 1 ! . Each copy of the mesh has its own copy of the entire 
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Fig. 2.9. For a batch of L 0 ' 5+o ^ B-bit integers: £ 0 ’ 681600+O ( 1 ) copies (25 copies in 
the illustration) of the mesh from Figure 2.7 and Figure 2.8. Each copy has the same 
£0.5+o(i) t ar g e t integers to factor. The total area of this circuit is £ 1 - 181600+o ( 1 ). In 
time L°- 25 +°W this circuit searches L 1 ' 382084+O ( 1 ) pairs (a, b). In time ^ 1 - 022400 +°( 1 ) 
this circuit searches all / / 2 154484 +°( 1 ) p a i rs (a, 6) and finds, for each target N t and the 
corresponding oti, all / / °- 681600 +"( 1 ) p a i rs (a, 6) for which a — bm and a — ba t are both 
smooth. 



list of iV’s; distributing the iV’s from an input port through the total circuit area 
takes time l°- 5|S+0 - 57+o W. The total circuit covers all L 29+ ° W pairs (a, b) and 
obtains, for each N, all of the T 7 +°l 1 ) smooth congruences required to factor 
that N. See Figure 2.9. 

We are not done yet: we still need to perform linear algebra for each N. 
To keep the communication costs of linear algebra under control we pack the 
linear algebra for each N into the smallest possible area. Allocate a separate 
square of area L' y+ °^ 1 ' ) to each N. and route each smooth congruence (N. o, b ) 
in parallel to the corresponding square; this is another standard sorting step, 
taking total time b 0 - 5 / 3 +°- 5 7+o( 1 ) for all B^+T'+ol 1 ) smooth congruences. Finally, 
perform linear algebra separately in each square, and complete the factorization 
of each N as usual. This takes time L 1 - 57 +°( 1 ). See Figure 2.10. 

The overall time exponent is max{0 — 7, 26 — 7 — /3, 0.5/3 + O.57, 1.57}, and 
the area exponent is /3 + 7. The final price-performance ratio, AT per integer 
factored, has exponent max{0, 29 — /3, 0.5/3 + 1.57, 2-57}. 

2.11. Comparison and numerical parameter optimization. Bernstein’s 
AT exponent from [7] was max{20, 2.57}. Batch NFS replaces 29 with 29 — /3, 
allowing 7 to be correspondingly reduced, at least until /3 becomes large enough 
for 2.57 to cross below (j). In principle one should also watch for 2.57 to cross 
below 0.5/3 + IA7, but Table 2.12 shows that 0 is more important. 
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Fig. 2.10. For a batch of / v 0 - 5 +°( 1 ) B-bit integers: ^°- 5 +°( 1 ) copies (16 copies in the illus- 
tration) of a linear-algebra circuit. Each circuit has area / / °- 681600 + o ( 1 ) _ The total area is 
£i.i8i600+o(i) c i rcu it has its own integer N t to factor and L °' 681600+o ^ 1 ' 1 pairs (o, b) 

for which a — bm and a — bcti are smooth. Routing all pairs (a, b ) from Figure 2.9 to an 
adjacent (or overlapping and reconfigured) Figure 2.10 takes time L°' 590800+Ol ' 1) . Each 
circuit uses L 0 ' 681600+o( - 1 ^ matrix-vector multiplications, and takes time L°' 340800+o( ' 1 ' 1 
for each matrix-vector multiplication. The total time is L 1 ' 022400+o( - 1 ' ) . 



Of course, even if we ignore the cost of finding the smooth a — bm (the term 
29-/3), our AT exponent is not as small as Coppersmith’s RAM exponent 
max{0, 27} from [20]. We have an extra 0.5/3 + I.57 term, reflecting the cost of 
communicating smooth congruences across a batch, and, more importantly, 2.57 
instead of 27, reflecting the communication cost of linear algebra. 

Table 2.12 shows the smallest exponents that we obtained for various /3, in 
each case from a brief search through 2500000000 pairs (/x, 7). The exponent 
of the price-performance ratio for batch NFS drops below Bernstein’s 1.976 . . . 
as soon as /3 increases past 0, and reaches a minimum of 1.704. . . as the batch 
size increases. (The minimum is actually very slightly below 1.704, but our table 
does not include enough precision to show this.) Finding all (a, b) with a — bm 
smooth is still a slight bottleneck for /3 = 0.4 but disappears for /3 = 0.5. When 
there are more inputs we partition them into batches of size L°- 5+o(1 P, preserving 
exponent 1.704 ... for the price-performance ratio. 

Our optimal 7 = 0.681 ... is much smaller than Coppersmith’s 7 = 0.819 . . ., 
for the same reasons that Bernstein’s 7 = 0.790 ... is smaller than the con- 
ventional 7 = 0.961 The natural time exponent for batch NFS — as above, 

this means the time exponent when price-performance ratio is optimized — is 
just 1.022 . . ., considerably smaller than the natural time exponent 1.185 .. . for 
single- key NFS. This means that collecting targets into batches produces not 
merely a drastic improvement in price-performance ratio, but also a side effect 
of considerably reducing latency. 
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1.763034 
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1.031517 


1.723580 


1.763034 


1.207815 


1.763025 


0.4 


1.710375 


0.820920 


0.684150 


1.055172 


1.710374 


1.710345 


1.226225 


1.710375 


0.5 


1.704000 


0.921150 


0.681600 


1.077242 


1.704000 


1.654484 


1.272400 


1.704000 



Table 2.12. Cost exponents for batch NFS in the AT metric. The batch size 
is L' 6+0 ^ ! . The AT cost is L e+0 ^\ The parameter m is chosen as exp((/n + 
o(l))(log Al) 2 ^ 3 (log log N) 1 ^ 3 ). The prime bound y is chosen as L 7 " 1 " 0 ^. The (a, b) 
bound H is chosen as L 0+o< ' l K The number of a — ba smoothness tests is " ! 1 1 
per target. The number of a — bm smoothness tests is L 2e -P+ 0 W per target. The AT 
cost of routing is x, 0 - 5/3+1 - 57+o (i) per target. The AT cost of linear algebra is £ 2 - 5 t'+°( 1 ) 
per target. All operations take place on a circuit of size L' 9+7+o( - 1 ^ . 



3 Early-abort ECM 

Section 2 used ECM as a low-area smoothness test for auxiliary integers c = 

f d a d 1- f 0 b d . Each curve in ECM catches a fraction of the primes p < y 

dividing c, and many curves in sequence catch essentially all of the primes p <y. 

This section analyzes a much faster smoothness-detection method, “early- 
abort ECM”. Not all smooth numbers are detected by early-abort ECM, but 
new heuristics introduced in this section imply that this loss is much smaller 
than the speedup factor. The overall improvement grows as a superpolynomial 
function of log y, and therefore grows as a superpolynomial function of the NFS 
input size. 

Specifically, it is well known (see, e.g. [21, page 302]) that (assuming standard 
conjectures) ECM uses exp ( y/(2 + o(l))log y log log y) multiplications modulo c 
to find essentially all primes p < y dividing c. Here o(l) is some function of y 
that converges to 0 as y — >■ oo. Consequently, if a fraction 1/S of the ECM inputs 
are smooth, then ECM uses 

S • exp(V (2 + o(l))log y log log y) 

modular multiplications for each smooth integer that it finds. This section’s 
heuristics imply that early-abort ECM uses only 



S • exp Q + o(l)) log y log log y 

modular multiplications for each smooth integer that it finds. Notice the change 
from 2 + o(l) to 8/9 + o(l) in the exponent. 

We emphasize again that this paper’s analyses are asymptotic. We do not 
claim that early-abort ECM is better than ECM for any particular value of y. 
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The rest of this section uses the word “time” to count simple arithmetic 
operations, such as multiplication and division, on integers with O(lgc) bits. 
Each of these operations actually takes time (lgc) 14- ^ 1 ), but this extra factor is 
absorbed into other o(l) terms when c is bounded by the usual functions of y. 

3.1. Early-abort trial division. Early aborts predate ECM. They became 
popular in the 1970s as a component of CFRAC [43], a subexponential-time 
factorization method that, like batch NFS, generates many “random” numbers 
that need to be tested for smoothness. 

The simplest form of early aborts is single-early-abort trial division. Trial 
division simply checks divisibility of c by each prime p < y, taking time y 1+ °( 1 ) . 
Single-early-abort trial division first checks divisibility of c by each prime p < 
y 1 / 2 ; then throws c away (this is the early abort) if the unfactored part of c is 
too large; and then, if c has survived the early abort, checks divisibility of c by 
each prime p < y. (Of course, when checking each prime p < y, one can skip the 
redundant checks of primes p < y 1 / 2 .) 

The definition of “too large” is chosen so that l/y 1 / 2 +°( 1 ) of all inputs survive 
the abort, balancing the cost of the stages before and after the abort. In other 
words, single-early-abort trial division checks divisibility of each input by each 
prime p < ^fy\ keeps the smallest l/y 1 / 2 +°( 1 ) of all inputs; and, for each of those 
inputs, checks divisibility by each prime p <y. 

More generally, (k— l)-early-abort trial division removes each prime p < y l ! k 
from each input (by dividing by factors found); reduces the number of inputs by 
a factor of y 1//fc , keeping the smallest inputs; removes each prime p < y 2 ^ k from 
each remaining input; reduces the number of inputs by another factor of y 1//fc , 
keeping the smallest inputs; and so on through y k l k = y. 

The time per input for ( k — l)-early-abort trial division is only y 1 / k + o( 1 ) . 
saving a factor y 1 ~ 1 / k + 0 ( 1 ) ^ if k i s limited to a slowly growing function of y. The 
method does not detect all smooth numbers, but Pomerance’s analysis in [45, 
Section 4] shows that the loss factor is only y( 1_1 / fe )/ 2 +°( 1 ), i.e., that the method 
detects 1 out of every y( 1_1 / fe )/ 2 +°( 1 ) smooth numbers. The overall improvement 
factor in price-performance ratio is y( 1_1 / fe )/ 2 +°( 1 ); if k is chosen so that k — > oo 
as y — > oo then the improvement factor is y 1 / 2 +°( 1 ). 

3.2. Early aborts in more generality. One can replace trial division with 
any method, or combination of methods, of checking for primes < y 1//fc , primes 
< y 2 / fc , etc. 

In particular, Pomerance considered an early-abort version of Pollard’s rho 
method. The original method takes time y 1 / 2 +°F) to find all primes p < y. Early- 
abort rho takes time only y 1 /( 2fc )+ fJ ( 1 ) _ and Pomerance’s analysis shows that it 
has a loss factor of only y( 1_1 / fc )/ 4 + 0 ( 1 ). 

Pomerance actually considered a different method by Pollard and Strassen. 
The Pollard-Strassen method takes essentially the same amount of time as Pol- 
lard’s rho method, and has the advantage of a proof of speed without any con- 
jectures, but has the disadvantage of using much more memory. 

Pomerance’s paper was published in 1982, so of course it did not analyze 
the elliptic-curve method. After seeing early aborts improve trial division from 
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y to y 1 / 2 , and improve Pollard’s rho method from y 1 / 2 to y 1 / 4 , one might 
guess that early aborts improve ECM from exp ( y/ (2 + o(l))log y log log y) to 
exp((l/2)^/(2 + o(l))logylog logy), but our heuristics do not agree with this 
guess. 

3.3. Performance of early aborts. Recall that ECM takes time T(y) 1+ °^ 
to find primes p < y, where T(y) = exp(i/2 log y log log y) . We actually consider, 
in much more generality, any factorization method M taking time T(y) 1+ °W to 
find primes p < y, where T is any sufficiently smooth function. 

Our early-abort heuristics state that the price-performance ratio of ( k — 1)- 
early-abort M is the geometric average 



to the power 1 + o(l). More generally, cutoffs yi,y 2 ,y 3 , • • • produce a geomet- 
ric average of T(yi), T(y 2 ), T(y 3 ), . . . with weights log yi, log y 2 - log yi, log y 3 - 
log y 2 , • • •• 

In particular, for any purely exponential T(y) = y c , the price-performance 
ratio is (aside from the 1 + o(l) power) 



(T(yV)T(yW') . . . T(y»-^)T(y)) ^ = ( S c V°'‘ • • • j,C^- ^ 



which converges to y C//2 = T(y) 1 / 2 as k increases, matching Pomerance’s anal- 
yses of early-abort trial division and early-abort rho. More generally, if T(y) = 
exp ( (7 (log y) 1 ^) then T(y*/ fc ) = T(y)^ k ^ 1/f so 

(r(y l / k )T(y 2 / k ) ■ • •T(yl fc - 1 l/ fe )T(y)) 1/fe = T{y)^=^ i / k ^ /1 ^ k -+ T(y) //(/+1) . 

To prove that ($^i= ► //(/ + 1) as k — »• oo, observe that 
is within k 1 ^ of z x ^dz = (//(/ + . ECM is essentially the case 

/ = 2: the geometric average is T(y) 2 / 3+o(4 h 

3.4. Understanding the heuristics. Let y and u be real numbers larger than 
1, define x = y u . and define So = {1,2,..., |xj } . Define \P(x, y) as the number of 
y-smooth integers in Sq. Then W{x, y) is approximately x/u u . See [45, Theorem 
2.1] for a precise statement. The same approximation is still valid for \P(x, y, z). 
the number of y-smooth integers in So having no prime factor < z. assuming 
that z < y 1_1 / lo s u ; see [45, Theorem 2.2]. 

Let k be a positive integer. Let yo, yi, y 2 , • • • , yk be real numbers with 1 = 
Vo < Vi < 2/2 < • • ■ < yk = V- Let £i,o; 2 , . . . ,Xk be positive real numbers with 



T(y l ^ k ) 1 ^ k T{y 2 ^ k ) 1 ^ k T(y^ k ) 1 ^ k ■ ■ - T[y) l ^ k 
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x = X\X2 ■■■Xk- Define 

51 = {c G S 0 : c/(yi-smooth part of c) < xfx \ }; 

52 = {c G Si : c/(y 2 -smooth part of c) < x/(xiX2)}', 



S k = {ce S k -i : c/(y k - smooth part of c) < x/{xix 2 ■ ■ • x k )}. 

Note that each element c & S k is y-smooth, since c divided by its y-smooth part 
is bounded by x/(xiX2 ■ ■ • x k ) = 1 . 

Consider any vector (si,s 2 » • • • , Sfc) such that each s* is a yi-smooth positive 
integer <Xi having no prime factors For any such (si, s 2 , . . . , s k ), the 

product c = sis 2 ---s k is a positive integer bounded by xiX2---x k = x, so 
c e So- Dividing c by its yi-smooth part produces s 2 • • ■ s k < x/xi , so c e Si- 
Similarly c£ 5 2 and so on through c £ S k . 

The map from (si, s 2 , . . . , s k ) to sis 2 • • • s k e S k is injective: the yi-smooth 
part of s 1 s 2 • • • s k is exactly Si, the y 2 -smooth part is exactly sis 2 , etc. Hence 
#S k is at least the number of such vectors (si, s 2 , . . . , s k ), which is exactly 
&(xi, yi, yo)\P(x 2: y2, yi)^(x 3 , y 3 , y 2 ) ■ ■ ■ &[x k , y k ,y k - 1). Pomerance’s early-abort 
analysis in [ 45 ] says, in some cases, that #S k is not much larger than this. We 
heuristically assume that this is true in more generality. 

The approximation &(xi, yi, yi-i) ~ x l /u 2i , where iq = (log Xj)/ log yi, now 
implies that #S k is approximately x/iu ™ 1 ■ ■ ■ u k k )- More generally, #S t is ap- 
proximately x/iu ™ 1 ■ ■ ■ v%% 

Write Ti for the cost of finding the y,;-smooth part of an integer. The early- 
abort factorization method, applied to a uniform random element of S 0 . always 
takes time Ti to find primes <yi; with probability #£i/#So ~ l/u ^ 1 takes addi- 
tional time T 2 to find primes <y 2 ; with probability #-S 2 /#So ~ ^/( U V u 2 2 ) fakes 
additional time T 3 to find primes <y 3 ; and so on. With probability #S k /#So ~ 
l/(tt “ 1 • • • u v k k ) an integer is y-smooth and survives all aborts. 

Balancing the time for the early-abort stages, i.e., ensuring that each stage 
takes time approximately T±, requires choosing x\ (depending on yi) so that 
u 1 1 ~ T 2 /Ti, choosing X2 (depending on y 2 ) so that u ^ 2 ~ T 3 /T 2 , and so on 
through choosing x k -i (depending on y k -i) so that u^ k Si ~ T k /T k _ j, Then 
x k is determined as x/(xi ■ ■ ■ x k -\), and u k is determined as (log x k )f log y k = 
u — (log^i • • -Xfe_i)/logy = u — ( Q1U1 + 02U2 + ■ ■ ■ + 6 k -iu k ~i) where = 
(log yi)/ logy. 

As a special case (including the cases considered by Pomerance), if all Ui are 
in u 1+ °^\ then T i+ i/Ti pa uOS is a 1 + o(l) power of u Ui , so u ^ 1 ■ ■ -u^ k is a 
1 + o(l) power of u Ul+ - +Uk = u ^+M^i)+-+n k -i(i- 0 k - 1 ) ^ which i s a 1 + 0 (i) 
power of 

u^Tz/Tt) 1 -^ • • • (T k /Tk-i ) 1 ~ 0k - 1 
= u v-T^T° 2 ~ 6 l T ^- e2 ■ ■ 

In other words, compared to the original smoothness probability l/u u of integers 
in S 0 , the found-by-ear ly-abort- factorization probability is smaller by a factor 
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The time for all stages of early-abort fac- 
torization is essentially T\. For example, for 0,, = i/k, the product of the time 
and the loss factor is (TiT 2 • • • T fc ) 1/,fc . 

We see two obstacles to proving the formula (TiT 2 • • •T/ c ) 1 / fe for early-abort 
ECM. First, the assumption rq G tt 1 +°( 1 ) is correct for exponential-time smooth- 
ness tests for standard ranges of x and y\ but it* G w D-5+o(i) f or ECM, except for 
i = k. Second, the error factor u°^ in the standard u u approximation is larger 
than the entire ECM running time. Despite these caveats we conjecture that the 
heuristics apply beyond the case of exponential-time smoothness tests, and in 
particular apply to early-abort ECM. 

Even when smoothness theorems are available, one should not overstate the 
extent to which they constitute rigorous analyses of NFS. There is no proof 
that NFS congruences have similar smoothness probability to uniform random 
integers; this is one of the NFS heuristics. There is no proof that ECM finds 
all small primes at similar speed; this is another heuristic. As mentioned ear- 
lier, Pomerance’s analysis in [45] actually uses the provable Pollard-Strassen 
smoothness-detection method, and Bernstein’s batch trial-division method [8] is 
proven to run in polynomial time per input; but both of these methods perform 
poorly in the AT metric. Similarly, Pomerance proved in [45] that Dixon’s ran- 
dom squares have similar smoothness probability to uniform random integers; 
but Dixon’s method is much slower than NFS, and proving something similar 
about NFS is an open problem. 

3.5. Impact of early aborts on smoothness probabilities. Because early- 
abort ECM does not find all smooth values, it forces batch NFS to consider 
more pairs ( a,b ), and therefore slightly larger pairs (a, b). This increase means 
that the auxiliary integers c are larger and less likely to be smooth. We conclude 
by showing that this effect does not eliminate the (heuristic) asymptotic gain 
produced by early aborts. 

Recall that the smoothness probability of c is heuristically l/v v , where v is 
the ratio of the number of bits in (|/d| + • • • + \fo\)H d and the number of bits 
in y. The derivative of v with respect to log H is d/ log y. so the derivative of 
log^) with respect to log H is d( 1 + log u)/ log y G 1/(37 1 1 ) + o(l); here we 
have used the asymptotics d G (1/y + o(l))(log A') 1 / 3 (loglog IV) -1 / 3 , logy G 
(7 + o(l))(log AT) 1 / 3 (log log iV) 2 / 3 , and logu G (1/3 + o(l)) log log N. 

Write S = (2/3)/(2 — 1/(37//)). Multiplying if by a factor T s+o( - 1 ) means 
multiplying the number of pairs (a, b ) by a factor T 2S+o(1 ^ and thus multiplying 
the number of smoothness tests by a factor T 25 +°A) . Meanwhile it multiplies 
v v by a factor T 6 A : ^i j -)+°( 1 ). and thus multiplies the final number of smooth 
congruences by a factor ii 2 ~ 1 A 3 'r/dF+°( 1 ) = j>2/3+o(i)_ Q ur heuristics state that 
switching from ECM to early-abort ECM reduces the number of smooth congru- 
ences found by a factor y 2 / 3 +°( 1 ), producing just enough smooth congruences 
for a successful factorization, while decreasing the cost of each smoothness test 
by a factor T 1+ °T) . The overall speedup factor is T 1 ” 25+ °( 1 ) . 
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For example, [ 7 ] took 7 « 0.790420 and // « 0.702860, so the speedup factor 
is j ,0 047 -- +°( 1 ). As another example, batch NFS with j3 = 0.5 takes 7 ~ 0.681600 
and /J, « 0.921150, so the speedup factor is j ,0 092 -+°( 1 ). 
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A ECM 

The elliptic-curve method of factorization (ECM) was introduced by H. Lenstra 
in 1987 [39]; [57] gives a good overview of ECM and a description of the most 
popular ECM software, GMP-ECM. The use of Edwards curves in ECM was 
suggested by Bernstein, Birkner, Lange, and Peters in [10] and implemented for 
amd64 architectures; see http://eecm.cr.yp.to/mpfq.html. A fast implemen- 
tation on GPUs of the particularly efficient curves of [9] was presented at Asi- 
acrypt 2012 by Bos and Kleinjung in [17]. This appendix gives a brief overview 
of Edwards curves and ECM. 

A.l. Edwards curves. An Edwards curve [22] over a field with 2 ^ 0 is given 
by an equation of the form x 2 +y 2 = 1 +dx 2 y 2 , for some d ^ {0, 1}. The addition 
law on an Edwards curve is given by 



The neutral element is (0, 1); (0, —1) has order 2; (±1, 0) have order 4. 

We presented projective addition formulas for Edwards curves in [13]. Sub- 
sequently several papers generalized the curve shape to twisted Edwards curves 
and improved the addition laws; the most efficient arithmetic is due to Hisil, 
Wong, Carter, and Dawson in [27], For an overview of the costs of elliptic-curve 
arithmetic in various representations see http://hyperelliptic.org/EFD/. 

A. 2. The elliptic-curve method of factorization. ECM is a variant of the 
p—1 method: it uses elliptic curves modulo c instead of the multiplicative group 
modulo c, where c is the number to be factored. 

The p — 1 method picks a random positive integer a < c and a number s and 
computes gcd{a s — 1, c}. In the p — 1 method a factor p of c is found, meaning 
that p divides the gcd, if the order of a modulo p is a divisor of s. To make 
this very likely for many factors of c, one chooses s as a very smooth number: 
typically s = lcm{l, 2, 3. .... Bi} for some smoothness bound B\. 

ECM replaces a with a point P on an elliptic curve modulo c, and replaces ex- 
ponentiation with scalar multiplication by s: it computes R = [s\P. We describe 
ECM with the elliptic curve instantiated as an Edwards curve. The computations 
on the curve modulo c use the same addition law as over the rational numbers 
Q where we reduce all results modulo c. 

Let again p be a factor of c and assume that the discriminant of the curve 
and all of the denominators in the coefficients of E and P are coprime with 
c (otherwise factors of c would already be found). Define P p and R p as the 
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reductions of the points P and R respectively modulo p. If the order of P p is a 
divisor of s then R p = (0, 1), so gcd{x(i?), c} is divisible by p, where x(R) denotes 
the ^-coordinate of R. This part of the computation is usually called stage 1 of 
ECM. Stage 2 pushes the limit of the largest prime in ord(P p ) that can be 
found with the method; a simple form of stage 2 computes R\ = \p k +i]R, R'i = 
\p k+2 ]Ri ■■■,Rt = \pk+t]R, where p k+1 ,p k+2 , . . . ,p k +e are the primes between 
Bi and another bound B 2 , followed by computing gcd{x'(i?,i)x'(i?, 2 ) • • • x{Rg), c}. 

Efficient implementations use windowing methods with precomputations to 
compute [s]P and compute the R L in succession by hrst computing the differences 
between successive values \pi+i — Pi\R. for i e [k + 1.. . . ,k + £ — 1} , namely 

The main advantage of ECM over the p — 1 method is that the curve can be 
varied; per number c many curves can be tried; this increases the probability 
that the order of one of the points P is smooth modulo one of the factors p. The 
p — 1 method is limited to the one group of order p — 1; if p — 1 is not smooth 
for any factor of c this method will not succeed while it is impossible to have 
primes p so that the order of any elliptic curve modulo p has only large factors. 

A. 3. Choice of elliptic curve. The choice of elliptic curve is influenced by 
the efficiency of scalar multiplication but also by the likelihood of leading to a 
factorization. The factor p is found if ord(P p ) is a factor of s or of s ■ p % for some 
i G {k + 1, k + 2, . . . , k + £}. The order of P p divides the order of E which is 
in the Hasse interval around p, namely [p + 1 — 2^Jp. p + 1 + 2 y/p\. McKee [42] 
showed that orders of elliptic curves are more likely to be smooth than general 
numbers of this size. Orders of Edwards curves have even larger smoothness 
probability because they have a guaranteed cofactor of 4 in the group order. For 
more details on smoothness chances see [10], [9], and [5]. 

It is possible to further increase the smoothness chances by choosing curves 
that have a large Q-rational torsion subgroup, i.e., for which the curve over 
Q has many points of finite order. Mazur’s theorem (see, e.g., [52]) limits the 
maximal number of points of finite order to 16. Atkin and Morain [3] gave a 
construction of curves attaining this maximum number of points of finite order 
which was translated to the setting of Edwards curves in [10]. This construction 
provides a family of curves E and points P, where it is ensured that P is none of 
the points of finite order (otherwise gcd{x'(P), c} would be divisible by c itself). 



